border point
Oversampling and Downsampling with Core-Boundary Awareness: A Data Quality-Driven Approach
Belhaouari, Samir Brahim, Kahalan, Yunis Carreon, Shaffique, Humaira, Belhaouari, Ismael, Islam, Ashhadul
The effectiveness of machine learning models, particularly in unbalanced classification tasks, is often hindered by the failure to differentiate between critical instances near the decision boundary and redundant samples concentrated in the core of the data distribution. In this paper, we propose a method to systematically identify and differentiate between these two types of data. Through extensive experiments on multiple benchmark datasets, we show that the boundary data oversampling method improves the F1 score by up to 10\% on 96\% of the datasets, whereas our core-aware reduction method compresses datasets up to 90\% while preserving their accuracy, making it 10 times more powerful than the original dataset. Beyond imbalanced classification, our method has broader implications for efficient model training, particularly in computationally expensive domains such as Large Language Model (LLM) training. By prioritizing high-quality, decision-relevant data, our approach can be extended to text, multimodal, and self-supervised learning scenarios, offering a pathway to faster convergence, improved generalization, and significant computational savings. This work paves the way for future research in data-efficient learning, where intelligent sampling replaces brute-force expansion, driving the next generation of AI advancements. Our code is available as a Python package at https://pypi.org/project/adaptive-resampling/ .
- Europe > Netherlands > Limburg > Maastricht (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- (2 more...)
Fast likelihood-based change point detection
Change point detection plays a fundamental role in many real-world applications, where the goal is to analyze and monitor the behaviour of a data stream. In this paper, we study change detection in binary streams. To this end, we use a likelihood ratio between two models as a measure for indicating change. The first model is a single bernoulli variable while the second model divides the stored data in two segments, and models each segment with its own bernoulli variable. Finding the optimal split can be done in $O(n)$ time, where $n$ is the number of entries since the last change point. This is too expensive for large $n$. To combat this we propose an approximation scheme that yields $(1 - \epsilon)$ approximation in $O(\epsilon^{-1} \log^2 n)$ time. The speed-up consists of several steps: First we reduce the number of possible candidates by adopting a known result from segmentation problems. We then show that for fixed bernoulli parameters we can find the optimal change point in logarithmic time. Finally, we show how to construct a candidate list of size $O(\epsilon^{-1} \log n)$ for model parameters. We demonstrate empirically the approximation quality and the running time of our algorithm, showing that we can gain a significant speed-up with a minimal average loss in optimality.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Finland > Uusimaa > Helsinki (0.04)
Deep Density-based Image Clustering
Ren, Yazhou, Wang, Ni, Li, Mingxia, Xu, Zenglin
Recently, deep clustering, which is able to perform feature learning that favors clustering tasks via deep neural networks, has achieved remarkable performance in image clustering applications. However, the existing deep clustering algorithms generally need the number of clusters in advance, which is usually unknown in real-world tasks. In addition, the initial cluster centers in the learned feature space are generated by $k$-means. This only works well on spherical clusters and probably leads to unstable clustering results. In this paper, we propose a two-stage deep density-based image clustering (DDC) framework to address these issues. The first stage is to train a deep convolutional autoencoder (CAE) to extract low-dimensional feature representations from high-dimensional image data, and then apply t-SNE to further reduce the data to a 2-dimensional space favoring density-based clustering algorithms. The second stage is to apply the developed density-based clustering technique on the 2-dimensional embedded data to automatically recognize an appropriate number of clusters with arbitrary shapes. Concretely, a number of local clusters are generated to capture the local structures of clusters, and then are merged via their density relationship to form the final clustering result. Experiments demonstrate that the proposed DDC achieves comparable or even better clustering performance than state-of-the-art deep clustering methods, even though the number of clusters is not given.
- North America > United States > California (0.04)
- Europe > Finland > North Karelia > Joensuu (0.04)
- Asia > China > Sichuan Province > Chengdu (0.04)
The EU plans to test an AI lie detector at border points
Trials for AI lie detection at border patrol checkpoints are set to begin soon in the EU. The program, called iBorderCtrl, will run for six months at four border crossing points in Hungary, Latvia and Greece with countries outside the European Union, as reported by Gizmodo. The system has users fill out an online application and upload some documents, like their passport, before a virtual border guard takes over to ask questions. According to New Scientist, some of these questions include "What's in your suitcase?" If iBorderCtrl determines the traveler is telling the truth, then they receive a QR code that will let them pass the border.
- Government > Regional Government (0.73)
- Government > Immigration & Customs (0.73)
Efficient Motion Planning for Problems Lacking Optimal Substructure
Salzman, Oren (Carnegie Mellon University) | Hou, Brian (Carnegie Mellon University) | Srinivasa, Siddhartha (Carnegie Mellon University)
We consider the motion-planning problem of planning a collision-free path of a robot in the presence of risk zones. The robot is allowed to travel in these zones but is penalized in a super-linear fashion for consecutive accumulative time spent there. We suggest a natural cost function that balances path length and risk-exposure time. Specifically, we consider the discrete setting where we are given a graph, or a roadmap, and we wish to compute the minimal-cost path under this cost function. Interestingly, paths defined using our cost function do not have an optimal substructure. Namely, subpaths of an optimal path are not necessarily optimal. Thus, the Bellman condition is not satisfied and standard graph-search algorithms such as Dijkstra cannot be used. We present a path-finding algorithm, which can be seen as a natural generalization of Dijkstra’s algorithm. Our algorithm runs in O ((n B · n) log(n B · n) + n B · m) time, where n and m are the number of vertices and edges of the graph, respectively, and n B is the number of intersections between edges and the boundary of the risk zone. We present simulations on robotic platforms demonstrating both the natural paths produced by our cost function and the computational efficiency of our algorithm.
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.05)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)